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Abstract. Molecular genetic tools have been a boon to arachnologists for decades and used to study many unique aspects 
of arachnid biology including genomics, phylogenetics, population genetics, and biogeography. These tools have evolved 
over time and now provide myriad methods for exploring evolutionary questions. Early tools, while still useful under the 
proper circumstances, are giving way to a new generation of DNA sequencing technologies. These new platforms yield 
impressive amounts of data at a fraction of the cost of traditional techniques. Herein, we discuss the history and future of 
molecular evolutionary arachnology in terms of available genetic/genomic tools and their potential applications, strengths, 
weaknesses, and relative costs. Next-generation sequencing (NGS) platforms are varied in their methods and potential uses, 
making high-throughput sequencing studies focusing on a wide array of questions tractable. To date, relatively few studies 
have employed NGS technologies using arachnids, but many could benefit from using them. Because no model species exist 
within the class Arachnida, we have a limited understanding of arachnid genomics. With the ever-advancing nature of 
sequencing technologies and bioinformatics, arachnologists can relatively easily implement NGS studies to bridge the gaps 
in our understanding and open avenues for deeper and more powerful experiments. To this end, we discuss examples of 
applications of NGS technologies focusing on arachnid taxa. Despite the allure of acquiring massive quantities of sequence 
data, we should recognize the limitations of existing NGS technologies and not forsake pre-NGS methods when these 
technologies could adequately address our questions. 

Keywords: Next-generation sequencing, genome, transcriptome, phylogeny, population genetics, genomics, adaptation, 
selection 
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1. INTRODUCTION 

Since Linnaeus and before, scientists have sought to put 
order into the diversity of life, the thirst for information 
increasing with the recognition of the role of evolutionary 
processes in shaping that diversity. The advent of molecular 
techniques in the 1980s introduced a huge diversity of novel 
markers for assessment of phylogenetic affinities. Moreover, 
with the growth of the human genome project, the potential 
use of vast numbers of genes across the genome was soon 
recognized (Jones 1991). Analytical tools were developed that 
could use the almost limitless data to address questions 
ranging from historical and recent demographic and migrato¬ 
ry patterns to identifying signatures of recent natural selection 
(Nielsen 2010; Rasmussen et al. 2011; Lohmueller 2011). Here, 
we examine where the field of arachnology stands within the 
genomic revolution. 

In recent years, advances in sequencing technology have led 
to great increases in genomic resources for many non-model 
species. The arthropod class Arachnida, encompassing over 
100,000 nominal species classified into 12-13 traditional 
orders (Krantz and Walter 2009; Blick and Harvey 2011), 
comprises a diverse array of taxa that serve key functions in 
terrestrial ecosystems as important predators and decompos¬ 
ers. Mueh of their diversity is unique to particular arachnid 
taxa, including complex silk production (spiders and mites), 
venom composition (spiders, scorpions, and pseudoscorpions) 
and detoxification of plant compounds (Grbic et al. 2011), and 
has long been of particular interest to researchers. As an 
example, spiders have been used to study behavior (reviewed 
by Herberstein and Hebets 2013), development (e.g., Ka- 
nayama et al. 2010; Wolff and Hilbrant 2011; Mittmann and 
Wolff 2012), sexual selection (e.g., Kuntner et al. 2009; Su et 
al. 2011), genetics (reviewed by Goodacre 2013), evolutionary 
ecology (reviewed by Moya-Larano et al. 2013) and biogeog¬ 
raphy (reviewed by Gillespie 2013), among other fields. 
However, relationships within and among arachnid taxa are, 
in many cases, not presently resolved (Giribet and Edgecombe 
2013), and a paucity of genomic resources has hindered efforts 
in various fields of arachnology. Molecular techniques have 
been successfully employed to investigate a number of these 
issues, but some problems require data at a scale heretofore 
unavailable to arachnologists. 

Here we first briefly review how more “traditional” 
molecular tools have been applied to arachnid biology and 
then go on to discuss emerging next-generation sequencing 
(“NGS”) applications and their potential impact within the 
field. Instead of deeply discussing each area of arachnology 
and thoroughly reviewing the literature, we provide a brief 
background with select citations and proceed to describe some 
ways in which the rapidly multiplying set of NGS tools may be 
used. 

2. "TRADITIONAL” MARKERS OF NUCLEAR 
GENETIC VARIATION 

Prior to genomic tools, multiple techniques were used for 
examining variation in nuclear DNA and hence to assess 
geographic and population structure, the most widely used 
being randomly amplified polymorphic DNA (RAPDs), 
restriction fragment-length polymorphisms (RFLPs), and 
satellite and microsatellite DNA. 


2.1. Allozyme electrophoresis.- Allozyme electrophoresis 
has proven very useful for the analysis of geographic structure 
of arachnids. Some early studies focused specifically on 
population structure (Porter & Jakob 1990; Steiner et al. 
1992; Smith & Engel 1994; Hudson & Adams 1996; Smith & 
Hagen 1996; Boulton et al. 1998), but allozymes have also 
been used to examine questions of relatedness among colonies 
of social spiders (Johannesen et al. 1998; Johannesen & 
Lubin 1999, 2001; Johannesen & Veith 2001; Evans & 
Goodisman 2002; Yip et al. 2012). paternity (Schafer & Uhl 
2002), species boundaries and speciation (Piel & Nutt 2000; 
Ramirez & Chi 2004), dispersal (Pedersen & Loeschcke 2001; 
Schafer et al. 2001), the effects of forest fragmentation, 
whether natural (Vandergast et al. 2004) or manmade 
(Ramirez & Haakonsen 1999; Gurdebeke et al. 2000), and 
to estimate selection on color polymorphisms (Tso et al. 
2002; Oxford 2005; Oxford & Gunnarsson 2006; Croucher et 
al. 2012) as well as patterns of diversification within rapidly 
diversifying lineages (Pons & Gillespie 2004; Baert et al. 
2008; De Busschere et al. 2010). 

The primary limitations of allozyme electrophoresis are: a) 
Organisms must generally be alive or deep-frozen before use; 
b) when bands co-migrate, they are assumed to be homolo¬ 
gous; c) only a very small subset of the genetic variation at a 
given locus is revealed; and d) it is not possible to distinguish 
ancestry and descent among different alleles. The technique is 
inexpensive, fast, and can give insight into multiple loci and so 
is useful for addressing questions of geographic structure. 

2.2. Satellite & microsatellite DNA.—Tandem repeats 
include three subclasses: satellites, minisatellites and micro¬ 
satellites. Satellites range in size from 100 kb to over 1 Mb 
with repeat units of ca. 100-200 bp; most are located at the 
eentromere. Minisatellites range from 1 kb to 20 kb in size 
with shorter repeats (9-80 bp), while microsatellites (also 
known as short tandem repeats, STR), are repeats of 
sequences less than about five base pairs in length (an 
arbitrary cutoff). Among spiders, satellite DNA has proven 
very useful in assessing relationships among species within a 
radiation of spiders in Hawaii; this was because the tandem- 
arranged units show a high intraspecific sequence identity due 
to concerted evolution (Pons & Gillespie 2003, 2004). As a 
result, the length of the branches and corresponding support 
were much greater for satellite DNA than for mtDNA 
sequence data. 

Microsatellites are repeated short sequences of DNA that 
occur throughout the genomes of many organisms, including 
spiders. Because repeat units are readily added to or lost from 
microsatellite DNA, the sequence length of these regions 
evolves rapidly. Microsatellites offer a valuable pool of genetic 
variation that has proven very useful for looking at paternity 
and relatedness among spiders, including social species (Ji 
et al. 2004; Bilde et al. 2009; Duncan et al. 2010), as well as 
understanding geographic structure between closely related 
populations (Rutten et al. 2001; Reed et al. 2007, 2011; 
Krehenwinkel & Tautz 2013; Parmakelis et al. 2013). 
However, compared to many other fields, the development 
and application of microsatellites in spider ecology and 
evolution has been limited. A potential cause for the paucity 
of microsatellite studies in arachnids is the apparent difficulty 
in finding reliable loci. However, the low % GC (percentage of 
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guanine and cytosine residues in DNA sequences) in some 
lineages, as discussed below, could potentially play a role. 

2.3. Random amplified polymorphic DNA (RAPD).—In the 
RAPD procedure, a single nucleotide primer (8-10 base pairs 
long) is used to amplify random sections of nuclear DNA, 
with differences in band sizes being used to provide 
information on relationships. The method has been used in 
spiders (e.g., A’Hara et al. 1998; Gurdebeke et ah 2003). 
However, although the approach provides a lot of variability, 
RAPDs suffer from poor repeatability, lack of codominance, 
and the possibility of non-heritable or non-homologous 
bands. 

2A Restriction fragment length polymorphisms (RFLP).- - 
For generating RFLPs, regions of nuclear DNA isolated 
through PCR or other means can be digested with restriction 
enzymes that cut samples of homologous DNA at specific 
four- or six-base sequences, differences arising from the 
locations of restriction enzyme sites. This technique could, 
compared to other technologies of the time, exploit an 
enormous amount of genetic variation. However, although 
used in mites (e.g., Osakabi & Sakagami 1994), it was never an 
important technique in other arachnid groups. 

2.5* Amplified fragment length polymorphisms (AFLP) — 
AFLPs use restriction enzymes to digest genomic DNA, with 
the fragments then amplified and separated, providing 
markers across many loci that are highly variable and are 
also reproducible. Like RAPDs, however, they are also both 
anonymous and dominant and may produce non-homologous 
bands. They have been used in studies of geographic structure 
among spider populations (Jung et al. 2006; Lambeets et al. 
2010; Croucher et al. 2011a, b), where they provided fine 
resolution of population differentiation and subdivision. They 
have also been used in assessments of inbreeding and sociality 
(Bilde et al 2005). 

3. SEQUENCING METHODS 

3.1. Sanger sequencing of mitochondrial DNA,—Mitochon¬ 
drial DNA, notably the cytochrome oxidase 1 (COl), NADH 
dehydrogenase 1 (ND1) and 16S rRNA genes (Agnarsson et 
al. 2013) proved particularly useful in the earliest studies of 
biogeography and species differentiation in spiders (Gillespie 
et al. 1994; Hedin 1997a, b; Johannesen et al. 2002). The 
reason for this is simply because of the abundance of 
mitochondrial DNA relative to nuclear DNA, making it 
much easier to amplify. However, problems with mtDNA that 
affect recently evolving lineages include the lack of recombi¬ 
nation, as a result of which it behaves as a single locus, making 
it of limited value for analytical approaches requiring multiple 
loci. This makes its use in species delimitation particularly 
problematic (Hamilton et al. 2014). Moreover, the haploid 
nature of mitochondrial DNA means that the marker is more 
sensitive to small population sizes than is nuclear DNA, and 
the maternal inheritance means that biases in movement 
between sexes cannot be recovered. For this reason, recent 
studies that have used mtDNA sequences have generally 
included various nuclear markers (e.g., Yandergast et al 2004; 
Starrett & Hedin 2007; Croucher et al. 201 la & b, 2012; Satler 
et al 2013). Mitochondrial DNA has been applied to 
questions at deeper phylogenetic levels where the microevolu¬ 
tionary problems mentioned above are less severe. However, 


the rapid evolution of the marker means that it tends to 
become saturated rather quickly (Brewer et al. 2013). 

3,2, Sanger sequencing of nuclear DNA,—Because of the 
issues of amplification, the most reliable, and hence useful, 
nuclear genes have tended to be those that occur in multiple 
copies such as Histone 3, and the ribosomal 18S and 28S 
genes, and these have been of particular impact in the realm of 
phylogenetic reconstruction (reviewed by Agnarsson et al. 
2013 and Giribet and Edgecombe 2013). At the population- 
species level, attention has focused on nuclear introns— 
noncoding sequences within nuclear genes, as these are not 
subject to the same selective constraints as exons and tend to 
evolve faster (Garb & Gillespie 2009; Hedin et al 2010). ITS 
(internal transcribed spacer) regions within the ribosomal 
RNA genes can frequently provide sufficient variability at 
shallow levels (Hormiga et al. 2003), though paralogy can 
often make the identification of homologous DNA difficult 
or impossible. Indeed, the problem of generating multilocus 
(nuclear) data has remained. Thus, researchers have looked 
increasingly toward modern, high-throughput (or “next- 
generation”) sequencing technologies as a potential means 
to generate large amounts of multilocus data by increas¬ 
ing the amount of data per monetary cost by orders of 
magnitude. 

33, Sanger sequencing versus NGS methods,—In the past, 
DNA sequence data were primarily collected using dideoxy- 
ribonucleotide (ddNTP) termination methods (i.e., Sanger 
sequencing). This approach provides long, high-quality 
sequences but suffers from a number of limitations (Table 1), 
including the ability to sequence only a single locus per 
reaction. In addition, reactions typically require taxon (or 
even population) specific oligonucleotide “primers”—short 
fragments of DNA (ca. 20-25 bp) of known sequence for 
polymerase chain reaction amplification and the sequencing 
reaction. Lastly, the cost of collecting data is much higher than 
in NGS approaches. Sanger sequencing methods are quite 
scalable in that one can easily obtain data for a single locus to 
hundreds of loci with a concomitant change in cost, but one 
cannot sequence massive amounts of genomic data from 
numerous specimens in a cost effective (in terms of time and 
money) manner. 

In contrast to Sanger sequencing, NGS techniques provide 
vastly larger quantities of data much faster and for far less 
money. NGS approaches achieve this in two ways. The first 
involves ligating or attaching adaptors (synthesized DNA 
strands of known sequence) to the ends of fragments of target 
DNA. These adaptors allow identical sets of PCR or 
sequencing primers to be used for all the DNA fragments (a 
form of “shotgun” sequencing). From an operational point of 
view, the major differences between the various NGS 
approaches are in the size of the DNA fragment (the “insert”) 
and the number of base pairs of sequence data that can be 
recovered from the end of the fragment. The second way that 
NGS approaches dramatically reduce costs is by miniaturiza¬ 
tion and parallelization—millions of sequencing reactions take 
place in small reaction chambers or flow cells (Shendure & Ji 
2008). Consequently, high-throughput methods can sequence 
numerous DNA fragments concurrently and in a single 
reaction/run with full-length sequences typically being assem¬ 
bled after the fact. 
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Table 1.—Summary of the strengths, weaknesses, starting material and applications of select molecular data sources. Pre- and post-next- 
generation sequencing molecular data sources discussed here are listed. Several positives and negatives are given for each technique, along with 
differences in starting material required. Additionally, several historical and potential applications in arachnology are provided for each method. 


Method 

Raw sequence output 

Cost per Mb 

Positives 

Negatives 

Potential in arachnids 

Traditional Sanger 
sequencing 

1.9-84 Kb 

$2,400 

Long, high quality 
reads; scalability 

Very high cost per Mb 

Traditional phylogenetics, 
population genetics and 
studies of few genes 

454 pyrosequencing 

0.7 Gb 

$10 

Long reads 

Problems with 
homopolymers; high 
cost per Mb (for 
next-gen technology) 

Genomes, transcriptome and 
microbiome studies 

Illumina (Solexa) 
sequencing 

600 Gb 
(HiSeq 2000) 

S0.05-S0.15 

High output; relatively 
low cost; widely used 
and supported 

Short reads 

Genome, transcriptome, 
massively barcoded 
amplicon and microbiomes 
studies 

SOLiD sequencing 

120 Gb 

$0.13 

Highly accurate 

Short reads; not as 
widely supported 

Genome, transcriptome, 
epigenetic and resequencing 
studies 

Ion Torrent 
sequencing 

20 Mb-1 Gb 

$1 

Scalability 

Short reads 

Transcriptome and barcoded 
amplicon studies 

MiSeq sequencing 

1.5-2 Gb 

$0.50 

Longer reads and 
smaller scale than 
older Illumina 
technologies 

Output too low for 
some applications 

Transcriptome, barcoded 
amplicon and microbiome 
studies 

Single molecule real 
time (SMRT) 

400 Mb 

S0.75-S1.50 

Very long reads 

High error rate; best 
combined with 
other technologies 

Genome and transcriptome 
studies 


3A NGS versatility in sequencing targets.—The “shotgun” 
nature of NGS, using ligated universal adaptors, gives these 
approaches tremendous versatility in terms of what can be 
sequenced. This versatility facilitates genomic data collection 
for organisms about which little or no prior genetic 
information is known. Sequenced DNA targets can therefore 
theoretically consist of any source of DNA from total genomic 
DNA for genome sequencing (see Section 5) to cDNA (derived 
from total RNA by reverse transcription of expressed mRNA) 
(Mortazavi et al. 2008) from whole organisms or specific 
tissues that may have experienced different “treatments” 
(“transcriptomics” and “differential expression”). RNAseq 
libraries used for transcriptome sequencing target only the 
transcribed portion of the genome and are therefore a type of 
“reduced representation library” (RRL) (Van Tassell et al 
2008), and we discuss this approach along with other RRL 
approaches such as Exon Capture (Bi et al. 2012) and RADseq 
(Miller et ah 2007) below. RRL approaches target specific loci 
and produces fewer unique reads per individual than whole 
genome libraries and on certain NGS platforms, especially 
Illumina (see below), may produce highly redundant amounts 
of data per individual. This has resulted in methodologies that 
multiplex numerous individuals using small, unique oligonu¬ 
cleotide indices (i.e., tags or barcodes). These are typically 
incorporated into the adaptors and allow the sequences from 
each individual to be post-hoc sorted computationally. 
Barcoding therefore permits NGS approaches to act as high- 
throughput variant detection and genotyping platforms (e.g., 
Dahl et al 2007; Meyer et al. 2008). Barcoding also comes into 
its own when the DNA targets originate from amplicons 
generated by traditional PCR approaches. Amplicons might 
be derived from long-range PCR of mitochondrial genomes, 
for example, or from standard molecular markers such as 


bacterial 16S, fungal 18S, or metazoan COL Such massive 
barcoding approaches with amplicon sequencing are per¬ 
mitting community-wide metagenomic/microbiome analyses 
(Amaral-Zettler et al 2009; Gloor et al. 2010; Caporaso et al 
2012) and large-scale phylogenetic studies (see below). 

4. NEXT-GENERATION TECHNOLOGY PLATFORMS 

Modern, high-throughput (or “next-generation”) sequenc¬ 
ing technologies have made many questions more tractable by 
increasing the amount of data per monetary cost by orders of 
magnitude. Although a number of NGS sequencing tech¬ 
niques have a steep learning curve (in terms of both wet- 
laboratory work and bioinformatics), much can be out¬ 
sourced, and myriad computational resources (many of which 
may be used free of charge) are readily available. Some of 
these techniques are widely used, while others are still more 
limited in their availability. All such methods have inherent 
strengths and weaknesses that can be leveraged to address a 
wide range of questions. As more molecular data are collected 
for arachnid taxa, these groups may begin to approach 
“model” organism status in terms of a foundational under¬ 
standing of genetics. This will allow more in-depth studies, 
using complex and powerful genetic and genomic techniques, 
to understand the basis of arachnid-specific traits. 

4.1. Second-generation NGS technologies*—-The basic logic 
behind several NGS technologies has been reviewed in recent 
works (Liu et al. 2012; McCormack et al. 2013; Quail et al. 
2012). Therefore, we do not delve into specifics of the 
approaches, instead choosing to highlight strengths and 
weaknesses of the platforms. Many of the points below are 
summarized in Tables 1 and 2. 

The first mainstream high-throughput technology was the 
Roche 454 system. This method relies on pyrosequencing 
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Table 2.—Comparison of DNA sequencing technologies with an emphasis on uses in arachnology. Several aspects of common sequencing 
technology are compared. The raw sequence output is highly variable between platform and shows the potential scalability provided by using 
different sequencing platforms. The cost per one million base pairs of data between methods also differs substantially and must be considered 
when attempting a study requiring sequencing. To help with choosing between platforms, we provide select positives and negatives for each. 
Finally, some potential high-level applications in arachnology are given. 


Data source 

Positives 

Negatives 

Starting material 

Applications in arachnids 

Pre next-gen sequencing 

Allozyme electrophoresis 

Ease of use; widely used 
in arachnology 

Requires fresh or frozen 
material; uncertain 
homology 

Fresh or frozen 
specimens 

Population genetics and 
biogeography 

Variable nucleotide 
tandem repeats 
(VNTR; satellites) 

More certain homology; 
widely used outside 
of spiders (many 
analytical tools) 

Difficult to design; may 
not work between even 
closely related taxa 

gDNA 

Paternity, population 

genetics and biogeography 

Random amplified 
polymorphic DNA 
(RAPD) 

Easier to implement than 
satellite techniques 

Lack of repeatability; 
codominance; 
uncertain homology 

gDNA 

Population genetics and 
biogeography 

Restriction fragment 
length polymorphisms 
(RFLP) 

Ease of use 

Fragments of same 
size may not be 
homologous; not widely 
used in arachnology 

gDNA 

Population genetics and 
biogeography 

Amplified fragment 
length polymorphisms 
(AFLP) 

Widely used 

Anonymous and 
dominant; uncertain 
homology 

gDNA 

Assessments of inbreeding; 
population genetics and 
biogeography 

Termination (Sanger) 
sequencing 

Sequence data; homology 
easier to infer; highly 
scalable compared to 
other non-NGS 
techniques 

Much more expensive 
than NGS at a per 
base level; taxon/ 
population specific 
primers needed 

gDNA 

Assessments of inbreeding, 
population genetics, 
biogeography, 
phylogenetics and 
single gene studies 

Post next-gen sequencing 

Genome sequencing 

Full sequence data, 
creates foundation for 
many future studies 

Costly, difficult, and 
unnecessary for 
many projects 

High quality, high 
quantity gDNA 

Genome studies, genetic 
mapping and developing 
model organisms 

Transcriptome sequencing 

Easy to sequence and 
serves a wide range 
of projects 

Costly and biased 
towards coding 
regions of genome 

High quality RNA 
from fresh or 
frozen specimens 

Studying coding sequencing, 
differential expression, 
identifying isoforms, evolution 
of gene families, functional 
genes, and deep phylogenomics 

RAD Tags 

Powerful for genetic 
mapping, population 
genetics and 
phylogeography 

Requires large amount 
of high-quality DNA 

High quality, high 
quantity gDNA 

Phylogeography, population 
genetics, species-level 
phylogeny and genomic 
mapping 

Target Capture 

Targets portions of the 
genome for wide 
array of projects 

Requires additional 
genomic information 

gDNA from fresh 
or preserved 
specimens (quality 
depends on 
application) 

Studies of specific regions of 
the genome, functional 
genes, population genetics, 
species-level phylogeny 
and phylogeography 

Anchored enrichment 

Orthology certainty and 
phylogenetics at various 
taxonomic levels 

Not developed in 
all groups 

gDNA from fresh 
or preserved 
specimens (quality 
depends on 
application) 

Phylogenetics of various 
taxonomic depths 


chemistry to obtain millions of unique reads. Roche’s 454 
approach provides relatively long reads (—700 bp) at a much 
lower cost (—$ 10/Mb) than traditional sequencing methods 
(—$2,400/Mb). The 454 sequencing technology, although 
expensive per base of data in comparison to other NGS 
methods and yielding fewer unique sequence reads, is still 
widely used in genome and transcriptome sequencing and 
metagenomics because of the relatively long length of 
individual reads. However, other techniques (namely Illumina 
technologies, see below) can now serve many of the same 


functions as 454 sequencing at a much lower cost while 
providing many more unique sequence reads. 

The second NGS platform to become widely used was 
Applied Biosystems sequencing by oligo ligation detection 
(SOLiD). This method uses the ligation of short probes to the 
template DNA. Each probe’s extension relies on two-base 
matches, yielding highly accurate results at a lower cost-point 
(~$0.13/Mb) and in higher quantities than 454 sequencing (see 
below). A minor downside to the SOLiD sequencing platform 
is that the output is in a format unlike other technologies and 
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requires computationally expensive algorithms to assemble. 
Nonetheless, this method can be used to efficiently study 
genomes, transcriptomes, and epigenetics (i.e., non-genetic 
modifications of the DNA sequence that affect expression 
such as methylation of CpG “islands”, areas of the genome 
containing high frequencies of cytosine and guanine residues). 

The last of the commonly used second-generation technol¬ 
ogies, and the most frequently used NGS platform, is the 
Illumina system. Illumina chemistry relies on fixed flow-cell 
binding site oligonucleotides and complementary adaptors 
that also contain sequencing primer sites, and that are ligated 
to the DNA fragments to be sequenced. This technology yields 
a vast quantity of raw data (600 Gb) for relatively low cost 
($0.05 - $0.15/Mb), and much effort has gone into developing 
novel ways of applying the method to a wide array of studies. 
These range from tweaking protocols to creating new 
algorithms and software for analyses. The main downside of 
the Illumina technology is that the sequence reads are 
relatively short. Early versions of the platform yielded reads 
that were only 36 bp in length, but read length as well as 
throughput continue to increase for all the NGS technologies, 
and the Illumina platform, for example, although still short in 
comparison to Sanger and 454 approaches, can now generate 
reads in excess of 150 bp. Short reads lead to complications in 
genome sequence assembly efforts and in community/micro- 
biome sampling where assembly of sequences from a mixed 
pool of taxa is problematic. Fortunately, several approaches 
have been developed to address these problems including 
combining Illumina data with other sources, using large insert 
sizes for scaffolding, and overlapping reads for metagenomic 
amplicon sequencing (Masella et al. 2012). Therefore, Illumina 
sequencing is often used in genome sequencing efforts, 
transcriptomics, community sampling, resequencing, target 
enrichment, and many more techniques. 

4.2. Compact personal genome sequencers.—In an attempt to 
down-scale high-throughput sequencing technologies to pro¬ 
vide a more manageable amount of data for less money, 
“personal genome sequencers” have been developed. These 
machines are less expensive to buy, use, and maintain; hence 
individual labs may realistically own these machines for smaller 
scale and exploratory sequencing experiments. The first of 
these, the personal genome machine (PGM), was released by 
Ion Torrent (Rothberget al. 2012). This platform is unique in its 
scalability. Sequencing takes place on individual disposable chips 
that can collect variable amounts of sequence data for differing 
levels of cost. This method is used for small genome sequencing 
(e.g., organellar or prokaryotic genomes) and transcriptomes. 
The Illumina MiSeq is similar in application to the larger Illumina 
platform, but at a smaller scale. Sequencing and data analysis are 
integrated into a single machine and can yield analyzed data in a 
single day. This method is commonly applied in highly 
multiplexed amplicon sequencing, small genome sequencing, 
microbial community analysis (Caporaso et al. 2012) and for the 
identification of transcription factors (i.e., ChiP-Seq). 

4.3. Third-generation NGS technologies. The newest high- 
throughput sequencing platforms, or single molecule sequenc¬ 
ing, include two main technologies—single molecule real-time 
(SMRT) sequencing by Pacific Biosciences (PacBio) and the 
unreleased Nanopore platform (Oxford Nanopore Technolo¬ 
gies, Oxford, UK). These methods are characterized by two 


main features: 1) no PCR prior to sequencing (limiting 
artifacts) and 2) sequences are recorded in real-time (i.e., 
during the polymerase reaction or depolymerization). These 
methods can each yield very long reads (>5 Kb and up to 13- 
14 Kb for PacBio) making them useful in de novo genome 
sequencing efforts. Short reads from the PacBio SMRT 
technology can be highly accurate since the platform has the 
ability to resequence the circularized molecule repeatedly until 
base confidences are high; however, long reads have a very 
high endemic error rate (ca. 15%). Approaches are being 
developed to correct the SMRT data using large quantities of 
accurate but short-read Illumina data (English et al. 2012; 
Koren et al. 2012). The Nanopore technology is currently not 
widely available, so much is still unknown concerning its 
performance. Moreover, although SMRT methods provide 
much longer reads than earlier NGS approaches with 
considerable simplification of library preparation, neither is 
currently well supported by common NGS bioinformatics 
tools. 

5. ARACHNID GENOME EFFORTS 

NGS technologies allow the sequencing and reconstruction 
or “assembly” of whole genome sequences. Accurate genome 
assembly in model organisms (organisms that are amenable to 
genetic study, have short generation times, breed in large 
numbers, and can inform about other organisms) has 
traditionally relied upon an edifice of classical genetics 
resources including inbred lines to minimize genetic variation, 
genetic linkage maps generated from laboratory crosses 
among inbred lines, and the sequencing and hierarchical or 
clone-based assembly of large 40-200 kb genome fragments 
called “bacterial artificial chromosomes” (BACs) large-insert 
libraries (Lander et al. 2001). Arachnids, like most non-model 
organisms, lack most of these resources. They often have long 
generation times and can be very small (forcing pooling of 
individuals and an increase in heterozygosity, making 
assembly difficult). Moreover, they are generally difficult to 
breed in captivity and, except for some mite species, no inbred 
lines are available, with the possible exceptions of naturally 
inbred social species such as the eresid Stegodyphus mimo - 
sarwn (J.S. Bechsgaard & T. Bilde pers. comm.) and theridiid 
Anelosimus exinnns (I. Agnarsson pers. comm.). 

5.1. Published genomes.—The three presently available 
arachnid genomes are from highly derived acarine species: 
the two-spotted spider mite Tetranychus urticae (Grbic et al. 
2011), the honey bee ectoparasitic mite Varroa destructor 
(Cornman et al. 2010) and the deer tick Ixodes seapidaris 
(http://iscapularis.vectorbase.org). The choice of these arach¬ 
nids as early targets for genome sequencing is perhaps 
unsurprising; Tetranychus and Varroa are of tremendous 
agricultural and economic importance, and Ixodes is of great 
importance as a vector of numerous livestock and human 
diseases including Lyme disease. In addition to its economic 
importance, Tetranychus urticae was selected as a candidate 
for genome sequencing as it has the smallest known genome of 
any arthropod at a mere 89.6 Mbp (Grbic et al. 2011), is easily 
cultured in the laboratory and has inbred lines available. The 
small Tetranychus genome was sequenced using traditional 
Sanger sequencing methods to a depth of 8.05X, resulting in 
640 scaffolds: 70,778 EST sequences plus RNA-seq data (see 
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below) were mapped to the genome and supported 15,397 of 
18,414 gene models. The genome of the ectoparasitic mite 
Varroci destructor , which has emerged as the primary pest of 
domestic honey bees (Cornman et al. 2010), was “surveyed” 
using 4.3X coverage of 454 sequence data from the DNA of 
1,000 pooled mites. This 2.4 Gbp was clearly insufficient to 
provide a comprehensive de novo assembly of this moderately 
sized genome (at 294 Mbp still far bigger than most sequenced 
insects) and yielded 184,094 contigs (assembled contiguous but 
not “scaffolded” sequences) with an N50 (weighted median of 
contig lengths) of 2,626 bp; however, the data were sufficient 
to permit the prediction of 31.3 Mbp of gene sequence, 
information about the integration of microbes into the 
genome and the occurrence of single nucleotide polymor¬ 
phisms (Cornman et al. 2010). Finally, the genome of the deer 
tick Ixodes scapula ris, which is very large compared to 
Tetranychus and Varroa at 2.1 Gbp, was shotgun sequenced 
using Sanger sequencing to a coverage of 3-6X. Although 
many data on the expressed gene sequences (i.e., the 
transcriptome) are available in the public databases, the 
genome sequence remains highly fragmented (e.g., ca. 571,000 
contigs with a contig N50 of 3000 bp) and has not been 
officially published (http://iscapularis.vectorbase.org). Of the 
three acarine species, Tetranychus provides the most complete 
genome reconstruction, with genome assemblies for Varroa 
and Ixodes remaining highly fragmented. 

5.2. Genomes in progress - what have we learned?—Apart 
from the three acarine species discussed above, our knowledge 
of the nuclear DNA structure of arachnids remains extremely 
limited. Most knowledge about arthropods comes from 
insects—a reflection of biological diversity, societal impact 
and economic and medical importance, and the scale of the 
research community, among other factors. From an evolu¬ 
tionary and phylogenetic perspective, this bias of course 
does not reflect relative importance. However, efforts such 
as the research community-driven 15K project (http://www. 
arthropodgenomes.org/wiki/Main_Page) that aims to sequence 
5,000 arthropod genomes over five years should redress the 
balance to some extent. Even so, of 787 species currently 
nominated for sequencing, there are 702 Hexapoda (89%), 64 
Chelicerata (8%), only 20 Crustacea (2%), and 6 Myriapoda 
(1 %) (http://www.arthropodgenomes.org/wiki/i5K_nominations). 
Several arachnids have been included in the pilot sequencing 
project of the I5K, and these are discussed below, together with 
our own efforts on Theridion (Theridiidae) and other efforts on 
Stegodyphns (Eresidae) and Acanthoscnrria (Theraphosidae). 
In addition, the genome of Limnlus has been sequenced, and a 
preliminary assembly is about to be publicly released (Nipam 
Patel pers. comm.). However, pre-NGS we have revealed much 
about arachnid mitochondrial genomes, and we briefly review 
this here before going on to examine nuclear genomes. 

Mitochondrial genomes: Although knowledge of the arach¬ 
nid nuclear genome remains in its infancy, several decades of 
research, based upon traditional PCR and Sanger sequencing, 
have yielded detailed knowledge of arachnid mitochondrial 
genomes. This work has revealed lineage-specific gene order 
rearrangements in Opiliones (Masta 2010) and pseudoscorpi¬ 
ons (Ovchinnikov & Masta 2012), and most interestingly has 
revealed truncated mitochondrial tRNA (and rRNA) second¬ 
ary structures among most arachnid lineages (Masta & Boore 


2004, 2008; Masta et al. 2008; Fahrein et al 2009; Masta 2010; 
Ovchinnikov & Masta 2012). NGS technologies can poten¬ 
tially greatly increase our understanding of the sequence 
diversity, variation and transcriptional mechanisms among 
arachnid mitochondria since 1) whole mitochondrial genomes 
can rapidly be sequenced from many barcoded and pooled 
individuals using amplicon sequencing (e.g., on small scale 
MiSeq or lonTorrent systems) (see below); 2) mitochondrial 
genomes can be assembled from total genome sequence data 
(Iorizzo et al. 2012); and 3) RNA-seq reads (lllumina-based 
method of sequencing cDNA obtained via reverse transcrip¬ 
tion of mRNA extractions) can be mapped to mitochondrial 
genes to explore expression differences among genes and taxa 
and post-transcriptional modification and editing of gene 
sequences (Smith 2013). 

Nuclear genomes: Since no non-acarine genomes have been 
published so far, detailed discussion of their structure is not 
yet possible. Our own efforts at sequencing a spider genome 
have focused on the Hawaiian happy face spider Theridion 
graUator and primarily have used Illumina paired end data 
based upon a variety of insert sizes. Initial assemblies were 
highly fragmented (resulting in many contigs of short length; 
i.e., a low “contig N50”). Although this is partly due to 
heterozygosity (no inbred lines are available), the main 
complication appears to be that this species has a low average 
% GC across the genome (ca. 28%) (Fig. 1). Although most 
arthropod genomes are somewhat “AT-rich” (e.g., the honey 
bee Apis me!lifer a has 34.8% GC; The Honeybee Genome 
Sequencing Consortium 2006), the only arthropod genome 
with a % GC in the range we have found is that of the pea 
aphid Acyrthosiphmn pisnui at 29.6% GC (The International 
Aphid Genomics Consortium 2010). 

A potential extreme % GC bias in arachnid genomes is both 
intriguing and technically challenging from both an informatic 
and molecular biological point of view. In order to investigate 
this further, we have examined the assembled contigs data, 
where available, from the pilot runs for the I5K project (http:// 
www.arthropodgenomes.org/wiki/Main_Page). In total we 
have examined the contig N50 length and % GC from two 
I5K sequenced spiders, Latrodectns hesperns and Parasteatoda 
tepidariornm (Theridiidae), and the Arizona bark scorpion 
Centrnroides sadtnratus (Buthidae), together with 15 other 
arthropods (14 insects and one copepod; ftp://ftp.hgsc.bcm. 
edu/I5K-pilot/), and these are plotted in Fig. 1. In addition we 
have included our data from Theridion graUator (PJPC 
unpubl. data) and data from Stegodyphns tnimosarmn 
(Eresidae: J.S. Bechsgaard & T. Bilde, pers. comm.). The 
scorpion and the three theridiid spiders (L. hesperns , P. 
tepidariornm and T. graUator) all have less than 30 % GC and 
correspondingly low contig N50 lengths. In general, the lower 
the % GC, the shorter the contig N50 as a simple function of 
decreased information content available to the assembly 
algorithms Interestingly, the P. tepidariorinm sequenced had 
been through five generations of inbreeding (A. McGregor 
pers. comm.)—apparently sufficient to reduce heterozygosity 
enough to substantially increase contig lengths despite a low % 
GC. 

Alternatively, S. mimosarum did not exhibit an extreme % 
GC bias (34% GC) and as a soeial species (Mattila et al. 2012) 
is somewhat naturally inbred—and has a correspondingly 


8 


THE JOURNAL OF ARACHNOLOGY 



Contig N50 (bp) 


Figure 1. Assembly contig N50 length (bp) is negatively correlated with average genome-wide %GC bias among arthropod taxa. As average 
%GC decreases so does the contig N50 (weighted median of contig lengths), since lower information content leads to more fragmented 
assemblies. Open diamonds refer to arachnid genome projects and closed diamonds refer to other arthropods. Theridion grciUator data are 
calculated from the authors’ own (unpublished) preliminary genome assembly and have the lowest %GC to date (28.37%). Stegodyphus 
mimosarum values from J.S. Bechsgaard (pers. comm.). C. dcirwini and N. clavipes values from I. Agnarsson (pers. comm.). All other values 
estimated from initial contig (not scaffolded) assemblies of the I5K initiative pilot genome assemblies; data and assembly parameters are 
therefore similar among species (ftp.hgsc.bcm.edu/I5K-pilot/). Although all arthropods show a bias toward low %GC (<32%), the theridiid 
spiders T. grallator, L. hesperus and P. tepidariorum , as well as the scorpion C. sculpturatus , have very low %GC. S. mimosarum has a more 
moderate bias (34% GC), and both this species and P. tepidariorum show the benefit of inbreeding and low-heterozygosity and have longer contig 
N50 lengths than the other arachnids. Additionally, the remaining non-theridiid spiders, C. danvini and N. clavipes , have moderate %GC values 
but low contig N50s (<10,000 bases), possibly due to heterozygosity stemming from the lack of inbreeding. The included insects are 1) Athalia 
rosae (turnip sawfly: Insecta: Hymenoptera); 2) Ceratitis capitata [Mediterranean fruitfly (medfly): Insecta: Diptera]; 3) Orussus abietinus 
(parasitic wood wasp: Insecta: Hymenoptera); 4) Cimex lectularius (bed bug: Insecta: Hemiptera); 5) Anoplophora glabripennis (Asian long- 
horned beetle: Insecta: Coleoptera); 6) Libellula fulva (scarce chaser: Insecta: Odonata); 7) Helicoverpa punctigera (Australian bollworm: Insecta: 
Lepidoptera); 8) Ephemera danica (green drake mayfly: Insecta: Ephemeroptera); 9) Agriiusplanipennis (emerald ash borer: Insecta, Coleoptera); 
10) Copidosoma floridanum (chalcid wasp: Insecta: Hymenoptera); 11) Homalodisca vitripennis (glassy-winged sharpshooter: Insecta: Hemiptera); 
12) Leptinotarsa decemlineata (Colorado potato beetle: Insecta: Coleoptera); 13) Eurytemora affinis [copepod: Maxillopoda (Crustacea): 
Calanoida]; 14) Limnephilus lunatus (caddisfly: Insecta: Trichoptera); 15) Pachypsylla venusta (hackberry petiole gall psyllid: Insecta: Hemiptera). 


much better assembly contiguity (J.S. Bechsgaard & T. Bilde 
pers. comm.). Furthermore, initial sequencing of the huge 
(6 Gbp) genome of the Brazilian white knee tarantula 
Acanthoscurria geniculata (Theraphosidae) indicates that this 
species has a ca. 40% GC genome content (J.S. Bechsgaard & 
T. Bilde pers. comm.). Until more arachnid genome sequence 
becomes available, the question as to how widespread %GC- 
bias is among arachnids will remain unclear. From the above 
data it may appear to be specific to theridiid spiders and 
Centruroides scorpions; however it is tempting to speculate 
that extreme %GC-bias may extend to other spider families 
and other arachnid orders. This possibility should be 
considered in future genome-sequencing efforts, and we note 
that transcriptome assembly (RNA-seq) is unlikely to be so 
impacted by %GC-bias, since coding regions typically do not 
exhibit such extreme biases. 

5.3. A cautionary note.—Despite the allure of NGS 
technologies, some caution is needed before embarking on a 
project to sequence an arachnid genome. Particular questions 
a researcher working on a specific taxon should pose are: 1) 


Do we need a genome sequence? And if so, 2) how complete 
do we need it to be? And, perhaps more fundamentally, 3) 
what level of completeness can we attain without spending an 
unreasonable amount of resources? In reality, no genome 
sequence (including human) is fully complete, and de novo 
assembled and NGS derived genomes are even less so. De 
novo assembly of short-read shotgun sequence data without 
references, such as linkage maps or BAG libraries, remains 
extremely challenging. However, as the Tetranychus , Varroa, 
and Ixodes projects demonstrate, a fractured assembly may 
still be useful if it is contiguous enough to build valid gene 
models. In addition to life history and often body-size 
considerations (i.e., the need for pooling individuals), intrinsic 
features of arachnid genomes—in particular, the low % GC 
content in some lineages mentioned above—might raise a 
substantial barrier to whole genome de novo assembly 
projects. 

Even though the cost of NGS sequencing continues to drop 
rapidly, depending upon the biological question, either 
classical genetic marker-based approaches (Section 3 above) 
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may be cheaper, easier, and sufficient, or NGS based 
alternatives to genome sequencing may be more attainable 
(e.g., transcriptome sequencing, and reduced representation 
methods). Indeed these approaches may even be best used as a 
means to rapidly develop numerous classical markers or 
identify single nucleotide polymorphisms (as discussed in 
Section 6). RNA-seq (the sequencing of cDNA libraries 
derived from extracted mRNA and hence targeting tran¬ 
scribed and therefore mainly coding regions—the transcrip¬ 
tome) is rapidly becoming the tool of choice in genomic 
studies. This is because RNA-seq data permits one both to 
build gene models rapidly and to measure “digital” gene 
expression among taxa and tissues; consequently the technique 
has many potential applications. 

Although “complete” genome sequences, even fragmented 
ones, will yield fascinating information about genome 
structure (repeats, transposons, translocations, etc.), to be of 
greatest functional utility genomes must be annotated. While 
computational annotation of gene models is possible (al¬ 
though of course not optimized for arachnids), most 
annotation schemes work best when supported by sequence 
evidence. Again, RNA-seq and transcriptome data are of 
greatest utility here and thus should also be generated for the 
taxon whose genome is sequenced. Since RNA-seq data can be 
assembled de novo, for example using software Trinity 
(Grabherr et al. 2011), and annotated by homology searches 
(at least for genes where known homologs exist) (e.g., using 
BlastX and Blast2GO; Conesa et al. 2005), the experimenter 
must again ask whether a full genome sequence is required at 
all, and be cautious about assuming that this is a practicable 
route. 

6. APPLICATIONS OF NGS TECHNOLOGIES 
IN ARACHNOLOGY 

The number of possible applications using NGS technolo¬ 
gies is vast and continues to grow. Here we provide examples 
of their use, most of which do not require the sequencing of 
entire genomes. There are many more potential applications 
than those discussed below, and, as new platforms and 
bioinformatic tools are developed, new avenues of research 
will open. 

6.L Functional genomics: adaptation & selection.—Biologists 
frequently seek to elucidate the relationship between environ¬ 
mental parameters and organismal diversity. The potential for 
detailing the genetic response of an organism to changes in the 
biotic and abiotic environment are now in plain sight with the 
availability of vast quantities of DNA sequence that can be 
generated by NGS technologies, in particular through the 
“assembly” of whole genome sequences. A review by Stapley 
et al. (2010) discusses the potential of high-throughput 
technologies in studies of adaptation. Whether focusing on 
coding gene sequences, differential expression of transcripts, 
identifying genomic regions experiencing linkage disequilibri¬ 
um (LD), or quantitative trait locus (QTL) mapping to detect 
genomic regions under selection, established methods using 
high-throughput data exist. 

Measures of selection: When studying protein-coding loci, 
the most common method for measuring selection involves the 
ratio of nonsynonymous to synonymous changes (dN/dS or 
co). The resulting value potentially indicates the mode of 


selection acting on the gene: co = 1 (neutral selection), co < 1 
(stabilizing selection) and co > 1 (positive selection). By 
employing a likelihood ratio test, P-values can be obtained to 
differentiate between neutral and directional selection in 
pairwise comparisons. Additionally, comparisons of co be¬ 
tween branches in a rnulti-species/population phylogeny are 
possible to identify genes or residues evolving differently or 
similarly between clades. Inherently, co-based tests for selection 
require coding data and are best served by transcriptomic 
data. Commonly used tools for analysis of these data include 
PAML (Yang 2007) and HyPhy (Pond et al. 2005). Some 
studies have employed co tests in arthropods (e.g., Averof 
2002; Porter et al. 2006; Viljakainen et al. 2009; Fort et al. 
2011), recently including spiders (Brewer et al. in review; Yim 
et al. in prep). 

To collect the data necessary for investigating selection in 
coding sequences, RNAseq libraries are often generated. To 
obtain the most nearly unique sequence possible in a single 
run, the resulting cDNA libraries can be normalized by 
removing excessive copies of highly expressed transcripts to 
“equalize” the numbers with respect to the more poorly- 
expressed transcripts (Zhulidov et al. 2005), but normalizing is 
not essential. In addition to retrieving sequences, non- 
normalized RNAseq libraries provide information concerning 
the expression levels of transcripts. In order to leverage this 
information, specimens must be treated to control all variables 
so that the sources of differential expression (DE) can be 
identified. Methods to analyze expression data using RNA-seq 
data include the R packages “edgeR” (Robinson et al. 2009) 
and “DEseq” (Anders & Huber 2010). Differences in 
expression of transcripts between populations or species 
indicate the evolution of coding loci involved in the expression 
of a gene or non-coding regions of the genome that affect the 
transcription (i.e., promoters, enhancers, and suppressors). 
These methods are currently being employed in Hawaiian 
Tetragnatha spiders to study differences in venom composition 
between a lineage that builds webs compared to one that does 
not build webs (Brewer et al. in review), building on earlier 
work that used protein gel electrophoresis patterns to show 
coarse differences between these lineages (Binford 2001). With 
NGS techniques, we are now able to explore the individual 
genes and relative changes in expression levels. 

Selection can also be examined using LD approaches, 
although this is necessarily limited to taxa where full genomes 
are available. By mapping SNPs to a reference genome, data 
obtained using reduced representation techniques (e.g., RAD- 
seq) can be used to detect regions of the genome under strong 
LD. This method has been used to identify regions of the 
genomes of stickleback populations that are resistant to 
introgression of outside genes (Hohenlohe et al. 2010). 
Unfortunately, RAD-seq methods require high quality and 
high quantity gDNA, which is often limited in small organisms 
such as many spiders, even when freshly collected (Cotoras 
unpubl. data). 

Molecular basis for adaptation: Perhaps the most important 
applications of NGS technologies in arachnids relate to silk 
and venoms, two aspects of the biology of these organisms 
that provide an almost endless variety of questions relating to 
gene function. Both silks and venoms comprise complex 
combinations of highly-derived, and often highly repetitive. 
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proteins that serve myriad functions within and between 
taxonomic groups. Both are linked to major ecological shifts 
and evolutionary modifications in a number of clades. The 
evolution of the forms and functions of spider silks has great 
potential in evolutionary studies, as well as bioengineering 
applications (Blackledge 2012; Garb 2013). Tools such as 
SMRT and Nanopore, with their long reads, could help to 
alleviate assembly issues associated with the highly repetitive 
elements and allow more detailed exploration of the diversity 
of spider silks at the genomic level. Venoms also vary greatly 
across the Arachnida and are found in several orders (e.g., 
Araneae, Scorpiones, and Pseudoscorpiones). Beyond differ¬ 
ential expression analyses, such as that described above, 
characterization of venom cocktails and their molecular 
evolution is lacking in most groups. Most work done so far 
has focused on medically relevant species such as those in the 
spider genera Latrodectus (e.g.. Garb & Hayashi 2013) and 
Loxoceles (e.g., Zobel-Thropp et al. 2013) and the scorpion 
genus Centruroides (e.g., Valdez-Velazquez et al. 2013). In an 
applied context, these compounds have vast potential in 
pharmacology and as pest control substances (reviewed by 
King and Hardy 2013). Moreover, as mentioned above, these 
compounds may also provide insights into the factors 
underlying adaptation and how selection acts at the transcrip¬ 
tional level (Binford 2001). 

6.2. Phylogenetics.—To date, most molecular studies of 
arachnids have sought to ascertain relationships between taxa. 
Until recently, assessment of the phylogenetic affinities of an 
organism required PCR amplification with degenerate primers 
followed by amplicon sequencing to study loci in distantly 
related taxa. The weakness of this approach is that rather few 
loci can be examined, limiting the resolution of the Tree of 
Life. Thus, the internal phylogeny of the subphylum 
Cheliccrata, class Arachnida, and lower taxonomic levels has 
remained unresolved despite numerous efforts to ascertain the 
relationships between taxa, including molecular phylogenetic 
studies (recently reviewed in Agnarsson et al. 2013; Giribet 
and Edgecombe 2013). Mitochondrial sequences have been the 
most common data source. However, for the reasons 
mentioned above (3.1), mitogenomic sequence data may not 
be appropriate for reconstructing deep arthropod relation¬ 
ships (Brewer ct al. 2013). For example, although the 
Euchelicerata (Xiphosura + Chelicerata) is almost unambig¬ 
uously recovered using nuclear loci, datasets using mitochon¬ 
drial genomic data often fail to support this relationship 
(Masta et al. 2009; Rota-Stabelli 2010). 

Within the Arachnida, most molecular phylogenetic studies 
have focused on spiders, including the relationships within the 
subclasses Mygalomorphae (Hedin and Bond 2006; Bond 
et al., 2012) and Araneomorphae (Blackledge et al. 2009). 
Molecular phylogenetic analyses within other orders exist, 
including the Opiliones (Hedin et al. 2010; Hedin et al. 2012; 
Burns et al. 2013), Acari (Klompen et al. 2007; Dabert et al. 
2010; Pepato et al. 2010), Scorpiones (Salomone et al. 2007; 
Borges et al. 2010; Prendini & Esposito 2010) and Amblypygi 
(Esposito et al. in review). Representing a small sampling 
of published works, all of these studies except Hedin ct al. 
(2012) use traditional Sanger sequencing approaches. Even 
at these lower taxonomic levels, nuclear molecular markers 
with appropriate phylogenetic signal are lacking, and primer 


combinations for PCR often do not transfer between arachnid 
groups, especially for species/population-level appropriate 
loci. 

High-throughput sequencing technologies provide a means 
to collect vast amounts of molecular data for many taxa in a 
timely manner and are currently used in various ways in 
phylogenetics (see McCormack et al. 2013 and Rocha et al. 
2013). The potential use of some NGS technologies in spider 
systematics was recently discussed by Agnarsson et al. (2013) 
and in Opiliones by Hedin et al. (2012). As for most non¬ 
model organisms, the most common NGS data sources for 
deep phylogenetics in arachnids are transcriptomes (Agnars¬ 
son et al. 2013) and information generated from bait capture 
techniques (for all taxonomic levels) such as anchored 
enrichment (Lemmon et al. 2012). These approaches do not 
require full genome sequences, which is especially useful given 
the potential difficulties with arachnid genome efforts 
mentioned above; moreover, the data generated provide loci 
that are relatively easy to assign orthology and can be used at 
deep taxonomic levels. Tools for the assignment of orthology 
include HaMStR (Ebersberger et al. 2009), OrthoDB (Water- 
house et al. 2012) and AGALMA (Dunn et al. 2013), while 
PhyDesign (Lopez-Giraldez and Townsend 2011) can be used 
to investigate the phylogenetic signal of a locus across an 
ultrametric tree. Recent molecular models of evolution (e.g., 
CAT, Lartillot and Phillippe 2004) and algorithms for 
phylogeny reconstruction (e.g., Phylobayes, Lartillot et al. 
2009; RAxML, Stamatakis 2006; and Fasttree 2, Price et al. 
2010) have made phylogenomic studies much more tractable. 
However, these analyses still can take weeks of computation 
time, require large amounts of computer memory, and 
demand a somewhat deep understanding of bioinformatics. 

6.3. Population genetics & phylogeography. NGS ap¬ 
proaches have been widely celebrated for their potential in 
providing large numbers of markers across the genome, which 
is essential for population genetic and phylogeographic 
studies. Since the per base cost is much lower than for Sanger 
sequencing, it has become economical to apply NGS 
techniques to generate traditional markers [e.g., microsatellites 
in a tetragnathid species (Parmakelis et al. 2013)]. 

Among the most useful tools for population genetics, and 
phylogenetics for that matter, are those based on reduced 
representation libraries (RRLs), which attempt to recover a 
small, random (i.e., unlinked) snapshot of the total genome. 
As a result of focusing on a small sample of the genome, the 
cost of sequencing a single individual is greatly reduced and 
yet RRL methods can still identify many thousands of usable 
single nucleotide polymorphisms (SNPs). 

RADseq is a popular method for genome-wide marker 
analysis because it reduces the complexity of the genome by 
sub-sampling at certain restriction sites, assumed to be 
homologous among taxa/specimens, to generate a single 
nucleotide polymorphism (SNP) data set. The approach is 
much like RFLPs and AFLPs, except that, instead of 
separating the fragments on a gel to recover a DNA 
fingerprint, they are sequenced (Davey et al. 2011). This 
approach can provide several SNPs from each fragment, 
multiplying the amount of data obtained from a single run. A 
recent modification of this technique uses a double-digestion 
and yields an increase in efficiency and a reduction in cost 
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(Peterson et al. 2012). However, for many arachnid groups, 
the RADseq method has requirements that may limit its use. 
First, a large amount of high molecular weight DNA is 
required (>2 micrograms per sample). Such high quality DNA 
is essential in order to generate fragments that result only from 
the restriction enzyme digest (i.e., library adaptors are not 
ligated to the end of randomly sheared/degraded fragments). 
Moreover, a high starting concentration of DNA is necessary, 
since the protocol involves many steps that result in the loss of 
DNA—typically only 7-15% of the starting material will be 
recovered. The issue of DNA quality can be resolved by 
preserving samples in 95% ethanol at -80 C, RNAlater at 
—80 C, or by using fresh specimens. Standard extraction kits 
using ion-exchange columns or salt precipitation should work 
well without causing undue shearing of the DNA. Ultimately, 
DNA yield depends on the organism and although large¬ 
bodied arachnids (e.g., many mygalomorphs, scorpions, 
amblypigids, or solfugids) may yield sufficient DNA, smaller 
taxa may require specimens to be “pooled” together, thus 
losing individual-level data (Emerson et al. 2010). 

An alternative RRL approach to RADseq is to use bait 
capture methods, such as Exon Capture, which, in contrast to 
RADseq, requires starting material to be randomly fragmented 
(Bi et al., 2012). The basic approach is to sequence the 
transcriptome of one individual and use those sequences to 
design small, overlapping probes that are then attached to a 
capture array (“chip”) or beads. The protocol starts with either 
naturally (i.e., degraded or historical) or intentionally frag¬ 
mented DNA, which is used to prepare DNA libraries following 
a standard NGS protocol (e.g., Meyer & Kircher 2010). These 
libraries are barcoded for each individual and used in a 
hybridization experiment similar to a microarray. The number 
of individuals that can be multiplexed in these experiments 
depends on the target size in base pairs (i.e., the number of bases 
printed on the chip), the desired depth of coverage and the 
availability of barcodes. The number of single barcodes 
commercially available is currently 96, but by using eight 
double barcodes this can be increased to 768. This theoretically 
allows the parallel sequencing of thousands of loci in hundreds 
of individuals at the same time. One advantage of exon capture 
for non-model organisms is that the sequences for the array are 
obtained directly from a transeriptome and do not require a 
previously sequenced genome. Moreover, the protocol is 
suitable for historical museum samples, since it explicitly 
requires randomly fragmented DNA, which is often the natural 
state for museum-derived material. 

There are two main limitations to the Exon Capture 
approach. The first is that starting costs are high (reagents 
and specialized equipment not common in most laboratories), 
though they can be minimized by sharing among research 
groups. Second, as in most NGS applications, sophisticated 
expertise in bioinformatics is needed to manage the large and 
complex data sets. Fortunately, user-friendly programs and 
tools are becoming increasingly available for post-NGS 
processing and analyses. Since exon capture targets exons, 
most of the captured variation will correspond to synonymous 
mutations in coding genes, allowing insights into population 
variability. However, because genomic DNA is captured, 
some of the non-coding flanking regions (e.g., untranslated 
regions, introns) will also be recovered. 


7. CONCLUSION 

Arachnids have a rich history of molecular studies focusing 
on many aspects of their biology. To date, few of these have 
made use of recent advances in sequencing technology, but, as 
we have outlined above, many future projects should benefit 
from the use of next-generation sequencing platforms. These 
technologies are diverse in their methods and applications, 
and promising advances are on the horizon. However, it is 
important to realize the strengths and weakness of NGS tools 
and to embrace traditional techniques when more appropriate. 

Although it is easy to be seduced by the amount of data that 
can be generated by sequencing an entire genome, this is often 
not necessary. In many cases, studies using transcriptomes 
or reduced representation techniques can collect incredible 
amounts of useful data to address any number of questions. 
Regardless of the study, the number of potential avenues to 
gather molecular data is large in terms of strategy and scale. 
As arachnologists continue to amass novel data from diverse 
lineages, our ability to identify loci, in terms of function and 
homology, will increase and open more research opportuni¬ 
ties. The unique biology and evolutionary history of arach¬ 
nids, coupled with technological and bioinformatic advances, 
will provide research opportunities for years to come. 
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